AITopics | relative positional encoding

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Neural Information Processing SystemsDec-24-2025, 20:47:39 GMT

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.

kernelized attention, relative positional encoding, transformer, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.96)

Add feedback

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Neural Information Processing SystemsJan-18-2025, 23:57:10 GMT

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.

kernelized attention, relative positional encoding, transformer, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Structure-informed Positional Encoding for Music Generation

Agarwal, Manvi, Wang, Changhong, Richard, Gaël

arXiv.org Artificial IntelligenceFeb-28-2024

Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.

baseline, positional encoding, transformer, (15 more...)

arXiv.org Artificial Intelligence

2402.13301

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding

Angelotti, Giorgio

arXiv.org Artificial IntelligenceOct-30-2023

In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens. To impose sequential order, token positions are typically encoded using a scheme with either fixed or learnable parameters. We introduce Hyperbolic Positional Encoding (HyPE), a novel method that utilizes hyperbolic functions' properties to encode tokens' relative positions. This approach biases the attention mechanism without the necessity of storing the $O(L^2)$ values of the mask, with $L$ being the length of the input sequence. HyPE leverages preliminary concatenation operations and matrix multiplications, facilitating the encoding of relative distances indirectly incorporating biases into the softmax computation. This design ensures compatibility with FlashAttention-2 and supports the gradient backpropagation for any potential learnable parameters within the encoding. We analytically demonstrate that, by careful hyperparameter selection, HyPE can approximate the attention bias of ALiBi, thereby offering promising generalization capabilities for contexts extending beyond the lengths encountered during pretraining. The experimental evaluation of HyPE is proposed as a direction for future research.

hype, positional encoding, sequence, (13 more...)

arXiv.org Artificial Intelligence

2310.19676

Country: Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Modeling Time-Series and Spatial Data for Recommendations and Other Applications

Gupta, Vinayak

arXiv.org Artificial IntelligenceDec-25-2022

With the research directions described in this thesis, we seek to address the critical challenges in designing recommender systems that can understand the dynamics of continuous-time event sequences. We follow a ground-up approach, i.e., first, we address the problems that may arise due to the poor quality of CTES data being fed into a recommender system. Later, we handle the task of designing accurate recommender systems. To improve the quality of the CTES data, we address a fundamental problem of overcoming missing events in temporal sequences. Moreover, to provide accurate sequence modeling frameworks, we design solutions for points-of-interest recommendation, i.e., models that can handle spatial mobility data of users to various POI check-ins and recommend candidate locations for the next check-in. Lastly, we highlight that the capabilities of the proposed models can have applications beyond recommender systems, and we extend their abilities to design solutions for large-scale CTES retrieval and human activity prediction. A significant part of this thesis uses the idea of modeling the underlying distribution of CTES via neural marked temporal point processes (MTPP). Traditional MTPP models are stochastic processes that utilize a fixed formulation to capture the generative mechanism of a sequence of discrete events localized in continuous time. In contrast, neural MTPP combine the underlying ideas from the point process literature with modern deep learning architectures. The ability of deep-learning models as accurate function approximators has led to a significant gain in the predictive prowess of neural MTPP models. In this thesis, we utilize and present several neural network-based enhancements for the current MTPP frameworks for the aforementioned real-world applications.

artificial intelligence, machine learning, spatial reasoning, (21 more...)

arXiv.org Artificial Intelligence

2212.13259

Country:

Europe > United Kingdom (0.13)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Virginia (0.04)
(17 more...)

Genre:

Research Report > Promising Solution (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Telecommunications (1.00)
Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

Wang, Zhicai, Hao, Yanbin, Gao, Xingyu, Zhang, Hao, Wang, Shuo, Mu, Tingting, He, Xiangnan

arXiv.org Artificial IntelligenceSep-12-2022

Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at https://github.com/Zhicaiwww/PosMLP.

arxiv preprint arxiv, positional encoding, transformer, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3503161.3547953

2207.07284

Country:

Europe > Portugal > Lisbon > Lisbon (0.05)
Asia > China > Anhui Province > Hefei (0.04)
Asia > Singapore (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Add feedback

Relative Positional Encoding

#artificialintelligenceSep-23-2021, 08:46:30 GMT

In this post, we will take a look at relative positional encoding, as introduced in Shaw et al (2018) and refined by Huang et al (2018). This is a topic I meant to explore earlier, but only recently was I able to really force myself to dive into this concept as I started reading about music generation with NLP language models. This is a separate topic for another post of its own, so let's not get distracted. Let's dive right into it! If you're already familiar with transformers, you probably know that transformers process inputs in parallel at once.

information, matrix, relative positional information, (12 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (0.50)

Add feedback

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Luo, Shengjie, Li, Shanda, Cai, Tianle, He, Di, Peng, Dinglan, Zheng, Shuxin, Ke, Guolin, Wang, Liwei, Liu, Tie-Yan

arXiv.org Machine LearningJun-23-2021

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

kernelized attention, relative positional encoding

arXiv.org Machine Learning

2106.12566

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.87)
Information Technology > Data Science > Data Quality > Data Transformation (0.53)

Add feedback

Relative Positional Encoding for Transformers with Linear Complexity

Liutkus, Antoine, Cífka, Ondřej, Wu, Shih-Lun, Şimşekli, Umut, Yang, Yi-Hsuan, Richard, Gaël

arXiv.org Machine LearningJun-10-2021

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.

arxiv, relative positional encoding, transformer, (14 more...)

arXiv.org Machine Learning

2105.08399

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Taiwan (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

relative positional encoding

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Structure-informed Positional Encoding for Music Generation

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding

Modeling Time-Series and Spatial Data for Recommendations and Other Applications

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

Relative Positional Encoding

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Relative Positional Encoding for Transformers with Linear Complexity